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Abstract 



In recent work, robust mixture modelling approaches using skewed distributions have been explored 
to accommodate asymmetric data. Wc introduce parsimony by developing skcw-t and skew-normal ana- 
logues of the popular GPCM family that employ an eigenvalue decomposition of a positive-semidefinite 
matrix. The methods developed in this paper are compared to existing models in both an unsupervised 
and scmi-suporviscd classification framework. Parameter estimation is carried out using the expectation- 
maximization algorithm and models are selected using the Bayesian information criterion. The efficacy 
of these extensions is illustrated on simulated and benchmark clustering data sets. 

1 Introduction 

The objective of cluster analysis is to organize data into groups wherein the similarity within groups and the 
dissimilarity between groups are maximized. A 'model-based' approach is one that uses mixture models for 
clustering. The use of finite mixture models has become increasingly common for clustering and classification, 
particularly with the use of Gaussian components. The Gaussian model-based clustering likelihood is 



where tTj > 0, such that J2i=i'^i = 1) mixing proportions and (/)(xj | is the density of a 

multivariate Gaussian random variable with mean /j.^ and covariance matrix Sj. Model-based classification 
is a semi-supervised version of model-based clustering (cf. Section 6) . 

Gaussian mixture models have been used for a wide variety of clustering applications, including work 
by McLachlan and Basford (1988), Bouveyron et al. (2007), McNicholas and Murphy (2008, 2010a,b), and 
Back and McLachlan (2010), amongst others. In efforts to accommodate data that exhibit some departure 
from normality, robust extensions are garnering increased attention. For instance, mixtures of multivariate 
f-distributions (McLachlan and Peel, 1998; Peel and McLachlan, 2000) have proven effective for dealing 
with components containing outliers. They have been the basis of a variety of robust clustering techniques 
that use mixtures of multivariate ^-distributions, including work by McLachlan et al. (2007), Andrews and 
McNicholas (2011a,b), Back and McLachlan (2011), Steane et al. (2012), and McNicholas and Subedi (2012), 
amongst others. 

Capturing components that are asymmetric can be tackled using skew-normal distributions (cf. Lin et al., 
2007) or other non-elliptically contoured distributions (e.g., Karlis and Santourian, 2009). One example of 
a non-elliptical distribution is the skew-normal independent distribution, as considered for finite mixture 
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modeling in Cabral et al. (2012). Another interesting alternative is the skew Student-f-normal distribution 
that has recently been used to model skewed heavy-tailed data (Ho, Pyne, and Lin, 2012). Although there 
arc many non-symmetric options available, we focus on mixtures of skew-normal distributions and mixtures 
of skew-i distributions herein. 

Recently, mixtures of multivariate skew-t distributions have been receiving some attention in the litera- 
ture. Of course, the skew-normal, t, and Gaussian distributions are all special cases of the skcw-i distribution. 
This property can be important in clustering applications because we often do not know the most appro- 
priate underlying distribution. Alongside the skewness parameter that accommodates asymmetric data, the 
degrees of freedom parameter allows for heavy tails, giving less weight to outlying observations in parameter 
estimation. Compared to work on parsimonious Gaussian and i-mixtures, the literature contains relatively 
little on parsimonious mixtures of multivariate skew- normal and skew-t distributions. The purpose of this 
paper is to go some way towards addressing this deficiency. 

The remainder of the paper is organized as follows. In Section 2, we introduce skew-t and skew-normal 
mixture models and briefly discuss the calculation of parameter estimates. Section 3 presents the construc- 
tion of parsimonious families of models that are analogues of popular Gaussian approaches. The proposed 
methods are compared with Gaussian and multivariate t analogues using simulation studies (Section 4) and 
four benchmark clustering data sets (Section 5). These models are extended further to semi-supervised 
classification in Section 6, which is followed by concluding remarks (Section 7). 



2 Mixtures of Skew-i and Skew-Normal Distributions 
2.1 A Mixture of Skew-i Distributions 

Although the computational tractability of Gaussian mixture models has contributed to their widespread 
popularity within the literature, their application is not always appropriate. For instance, Kotz and Nadara- 
jah (2004) argue that the multivariate-f distribution provides a more realistic model for real-world data and 
it has been noted (e.g., Lin et al., 2007) that Gaussian mixture models have a tendency to over fit skewed 
data. Therefore, it is natural to consider a single distribution, namely the multivariate skew-t distribution, 
that conflates the robust properties of the t-distribution with a skewness parameter to account for asymme- 
try. We outline a model-based approach using parsimonious mixtures of multivariate skew-t distributions; 
we also consider a mixture of skew-normal distributions, which is a limiting case. 

There are a number of 'skew-t' distributions within the literature. The version that we adopt, as defined 
by Pyne et al. (2009), uses a particular stochastic representation of the multivariate skew-t distribution 
given by Sahu et al. (2003). This distribution also corresponds to the so-called 'restricted' multivariate 
skew-t distribution, as defined in Lee and McLachlan (2012). Adopting this characterization, a random 
vector Y is said to follow a p-variatc skew-t distribution with location vector ^, scale matrix fl, skewness 
vector A, and v degrees of freedom if it has the representation 



where 



Y = X\U\+X, (2) 



andT4^~r(i//2,!//2). 

We consider a ^-component mixture of p-dimensional skew-t distributions with density given by 

a 

fivj I *) = Y,T^Myj I (3) 

i=l 
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where ^iUj \ is the density of a multivariate skew-f distribution with location vector scale 

matrix fij, skewness parameter Aj, and Ui degrees of freedom, and * contains all model parameters, i.e., ^ 
contains the parameters {wi, fij, A,, : i = 1,. . . ,g}. 

2.2 A Mixture of Skew-Normal Distributions 

Now consider the skew-normal distribution, which is a limiting case of the skew-i distribution. This version 
of the skew-normal distribution is the 'restricted' version of the skew-normal distribution defined by Sahu 
et al. (2003) and proposed for the analysis of flow cytometric data by Pyne et al. (2009). Resembling the 
above characterization (Section 2.1), a random vector Y is said to follow ap-variate skew-normal distribution 
with location vector ^, scale matrix CI, and skewness vector A if it has the representation 

Y = X\U\+X, (4) 

where 

We consider a (^-component mixture of p-dimensional skew- normal distributions. The density is given by 

g 

fiyj\&) = J27rMyj\^i,^i,>^i), (5) 

where <fi{yj \ ^j, f^j, Aj, !>'j) is the density of a multivariate skew- normal distribution with location vector 
scale matrix fij, and skewness Aj, and © contains all model parameters, i.e., contains {ttj, ^j, fl,, Aj : i = 
l,...,g}. 

2.3 Parameter Estimation 

Parameter estimation for mixtures of multivariate skew-normal distributions is carried out via an expectation- 
maximization (EM) algorithm (Dempster et al., 1977) as outlined by Pyne et al. (2009). However, maximum 
likelihood estimation for mixtures of multivariate skew-t distributions is more involved and has been tackled 
using a number of different variations of the EM algorithm. Vrbik and McNicholas (2012) derive closed form 
solutions for the skew-t random variable defined in (2), allowing for a traditional EM algorithm to be used. 
This aforementioned procedure is employed for our proposed study but multiple techniques are available. 
For example, a Bayesian approach is explored in Friihwirth-Schnatter and Pyne (2010), wherein an eflacient 
Markov chain Monte Carlo (MCMC) scheme is proposed for finite mixtures of multivariate skew-normal and 
skcw-f distributions. 

A nice overview of the differing characterizations of the skew-t distribution and their applications has 
recently been given by Lee and McLachlan (2012). Apart from providing an up-to-date account of the recent 
work with the skew-t distribution, they also develop some new results on the restricted and unrestricted 
multivariate skew-t distribution (rMST and MST, respectively). The E-step for mixtures of MST can be 
expressed in closed form apart from one term and, when using a one-step-late M-step, this term can be 
taken to be zero or be expressed in closed form. Lee and McLachlan (2012) offer an alternative method to 
calculate the intractable E-step for the rMST. This is accomplished by manipulating the expectations to 
be expressed in terms of the moments of the truncated non-central multivariate f-distribution (cf. Ho, Lin, 
Chen, and Wang, 2012). 

2.4 Comments 

The additional skewness parameter gives these models the advantage of being able to adjust for asymmetric 
data while the degrees of freedom in the skew-t case acts as a robustness tuning parameter to accommodate 
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heavier tails. The contour maps in Figure 1 illustrate various shapes attained by the bivariate skew-normal 
distribution, with teardrop-like as well as spherical and elliptically shaped clusters. The pictures are con- 
structed using a multivariate skew-normal distribution with = I to isolate the effect of skcwncss on 
otherwise elliptically shaped clusters. Note that a multivariate skew-t distribution with large degrees of 
freedom would appear similar, but we would see more in the tails with lower values of degrees of freedom. 
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Figure 1: Density contours from bivariate skew-normal distributions with ^ = 0, 17 = I, and varying values 
for the skewness parameter (A). 

3 Methodology 

3.1 Two Skewed Families 

Despite the fact that there has been support for model-based approaches for quite some time now, only 
in the last two or more decades has this approach become popular. Owing greatly to the advances in 
computing, model-based approaches are quickly becoming an attractive tool for clustering and classification. 
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For Gaussian model-based clustering, model fitting becomes an issue due to the number of parameters in 
the component covariance matrices: there are Gp{p + l)/2 in all, cf. (1). This same problem manifests for 
mixtures of multivariate t, skew-normal, and skew f-distributions. 

Celeux and Govaert (1995) introduced parsimony into Gaussian mixtures by imposing constraints on 
eigen-decomposed component covariance matrices. This parameterization, originally considered in Banfield 
and Raftcry (1993), is an eigenvalue decomposition of the component covariance matrices given by = 
AiDjAjD^, where Dj is the orthogonal matrix of eigenvectors of Sj, A, is the diagonal matrix of entries 
proportional to eigenvalues with |Aj| = 1, and Aj is the associated constant of proportionality. Celeux 
and Govaert (1995) developed a family of fourteen Gaussian parsimonious clustering models (GPCMs) by 
applying different constraints to this eigen-decomposed covariance structure (Table 1). A subset of these 
models make up the of family ten models available in the mclust package (Praley et al., 2012) for R (R 
Development Core Team, 2012). In addition to clustering applications, Gaussian mixtiue models with 
this covariance decomposition have been used for discriminant analysis (Bensmail and Celeux, 1996) and 
classification (Dean et al., 2006). 

Table 1: The nomenclature used in the MCLUST family and borrowed for the models developed herein, 
along with the decomposition constraints and number of free parameters in the decomposed matrix. Note: 
'E' denotes equal across groups, 'V denotes varying across groups, T denotes the identity matrix. 



Model 




Ai = A 


Dj = D 


Free Covariance/Scale Paramet 


EII 


E 


I 




1 


VII 


V 


I 




G 


EEI 


E 


E 




P 


VEI 


V 


E 




G + (p - 1) 


EVI 


E 


V 




Gp-(G- 1) 


VVI 


V 


V 




Gp 


EEE 


E 


E 


E 


b(p + i)/2] 


VEE* 


V 


E 


E 


b(p+l)/2] + (G-l) 


EVE*+ 


E 


V 


E 


[p(p+l)/2]-(G-l)(p-l) 


VVE*+ 


V 


V 


E 


[p(p + l)/2] + (G - l)p 


EEV 


E 


E 


V 


G\pip + l)/2]-iG-l){p) 


VEV 


V 


E 


V 


G\p(p + l)/2]-{G-l){p-l) 


EVV* 


E 


V 


V 


G[p(p l)/2] - (G - 1) 


VVV 


V 


V 


V 


G[p(p -Hl)/2] 



*These models are not considered in MCLUST. -t-These models are not considered in tEIGEN. 



Andrews and McNicholas (2012a) introduced the tEIGEN family, which comprises ^-analogues of the 
MCLUST models, along with two other GPCM models, with the added restriction of constraining or not 
constraining the component degrees of freedom Vi to be equal across groups. The tEIGEN family is supported 
by theteigen package (Andrews and McNicholas, 2012b) for R. The nomenclature for the tEIGEN models 
consist of four letters: 'C denotes that a constraint is imposed, 'U' denotes that a constraint is not imposed, 
and T denotes that the matrix in question is taken to be the identity matrix of suitable dimension. For each 
member of the tEIGEN family, relevant letters indicate the constraints on Ai,Dj, A^, and I'i, respectively. 

Herein, we consider parsimonious families of multivariate skew-normal and skew-f distributions that are 
non-elliptical, robust extensions to the GPCM family; we refer to our families as the SNCLUST and StCLUST 
families, respectively. Note that the geometric interpretation of the component shapes for members of the 
SNCLUST and StCLUST famihes is not the same as for members of the GPCM, MCLUST, and ffilGEN 
families unless the skewness is zero. In the GPCM family, the constraints (cf. Table 1) are applied to 
the decomposed component covariance matrices and so can be considered in terms of the volume, shape, 
and orientation of the component densities. However, for the SNCLUST and StCLUST families, we are 
decomposing the component scale matrices ilf, accordingly, the shapes of the corresponding component 
densities cannot be interpreted after the fashion of MCLUST (or GPCM) unless the skewness is zero (cf. 
Wang et al., 2009). This lack of geometric interpretability is of no importance because the rationale for 
imposing constraints on the decomposed component scale matrices herein is simply to introduce parsimony. 
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3.2 Parameter Estimation 

This section outlines the parameter estimation for the StCLUST family. Estimates for the SNCLUST family 
can be derived in a similar fashion. First, note that, from the representation given in (2), the joint distribution 
of y,u,w can be written 

f{y,u,w) = f{y\u,w)f{u\w)f{w) 

.CXp\-hy-^-X\u\y(-) \y-^-X\u\)] 



: cxp 



2(l/«;)/ r(|) ^12/ 



V^27r(TAu) 

('i^-)'^/2„,(i'+p-l)/2 

= (tWiniird) {-I [(^ - « - ^M)'n-(v - « - AM) + + .] } . 

Within the EM framework, it is convenient to adopt the following notation. We regard the observed 
data y = {y\, . . . , j/^) as being incomplete, whore it = (ui, . . . , u„) and w ~ (wi, . . . , w^n) are unobservable 
latent variables and the component membership labels z = (zi, . . . , z„) are the missing data. Note that z 
is defined such that Zj = {zij, . . . , Zgj)' for j = 1, . . . , n, where Zij = 1 if y^ belongs to component i and 
Zij = otherwise. Now, {y, u, w, z) is referred to as the complete-data and the complete-data log-likelihood 
for unconstrained component scale matrices ili,. . . ,flg is given by 

g n g n 

logjSf(* I y,u,w,z) = ^^ZijlogWi + ^^z,jlogfi{yj,Uj,Wj) =jSfi+^2+^3, (6) 

i=l j=l i=l j=l 

say, where 

g n 

ifi = ^^2;ijlog7ri, 

i=i j=i 

g n ^ 

-^2 = 5^ - 2 [log(27r) + log inr'l + Wj{y^ - - A,Uj)'f^r'(yj - - Kuj)] }, 

i=i i=i 

=^3 =EE^^i{ - 2 [(P- l)logK) + f K- - log('^i/2)] +logr(t/,/2) + {vi/2 - l)logK)}. 

i=l j=l 

In the E-step, the expected value of the complete-data log-likelihood can be found using any of the 
three methods mentioned in Section 2; we use the method of Vrbik and McNicholas (2012) in our analyses 
(Sections 4 and 5). The updates for TTj, ^j, Aj, and Vi needed in the M-step are given by Wang ct al. (2009); 
parameter estimation for the decomposed elements of r2j = AjDjAjD^ will depend on the model under 
consideration (cf. Table 1) and can be found in an analogous fashion to the Gaussian case. To acquire the 
updates for Aj, Dj, and Aj, we need only maximize ^2 in (6), which is equivalent to minimizing 

g n g n 

J2 E \^^~^\+ J2 E ^^j'^iiVj -ii- Kujyn-\yj - - XiUj) 

i=l j=l i=l j=l 

g n g n 

= E E + tr{ E E -^i- AiMj)'nr^(j/j - - A,Mj)| 

i—1 j — 1 i=l j=l 

9 n g 
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where Wi = J2^=i ^ij'^jiVj ~ ~ ^i'^j){yj ~ ~ ^Uj)' . Finally, define W := J2i=i ^^'^ ^^^^ ^^^^ 
parameter estimation for Ai, Dj, and Aj becomes identical to that carried out by Celeux and Govaert (1995). 

3.3 Model Selection 

The best model from amongst the ten members of the SfCLUST family is determined by the popular 
Bayesian information criterion (BIG; Schwarz, 1978). Althoiigh the underlying regularity conditions for the 
asymptotic approximation are not generally satisfied (cf. Kcribin, 2000), the BIG has shown to be a useful 
tool for selecting amongst mixture models (e.g., Fraley and Raftcry, 2002; Andrews and McNicholas, 2011a). 
The BIG is given by — 2^(x, t?) +TOlogn, where m is the number of free parameters, /(x, i?) is the maximized 
log-likelihood, and & is the maximum likelihood estimate of Dasgupta and Raftery (1998) propose using 
the BIG for mixture model selection. When defined as above, the model with the largest BIG is selected. 
Model selection for the SNGLUST family is handled in the same way. 

3.4 Performance Assessment 

Because the true classes for the data sets used in the applications that follow are known, the adjusted Rand 
index (ARI; Hubert and Arabic, 1985) is used to assess the clustering results. The ARI is a popular measure 
of classification agreement between the true and predicted group memberships or, more generally, between 
any two partitions. The ARI is a corrected form of the Rand index (Rand, 1971) that adjusts for chance 
agreement. An ARI of 1 corresponds to perfect agreement whereas an ARI of corresponds to results no 
better than would be expected by guessing. 

4 Simulation Studies 

4.1 Introduction 

Although the proposed methods can explicitly account for skewed components, they should also be capable 
of capturing symmetric components, e.g., data from a multivariate Gaussian or t-distribution. In Section 4.2, 
we illustrate that the StGLUST and SNGLUST families can uncover underlying Gaussian and t-mixturcs. 
We then demonstrate the difficulties encountered by the MGLUST and tEIGEN families when faced with 
skewed data (Section 4.3). The mclust and teigen packages are used to run the MGLUST and tEIGEN 
models, respectively. In Section 4.4, we present a more difficult data set in which we compare the four 
algorithms. Note that to facihtate direct comparison with MGLUST, we restrict the TEIGEN, SNGLUST, 
and StGLUST families to ten models to correspond to the MGLUSTcovariance structures (cf. Table 1) for 
all our simulations. 

4.2 Simulation 1 

For this simulation, the first component, of size iii = 300, was generated from a bivariate normal distribution 
with parameters 

^i=(o)' ^^=G i)' 

and the second component, of size n2 = 200, was generated from a bivariate i-distribution with v = 4 degrees 
of freedom and parameters 

The clustering performance of the StGLUST family was excellent, yielding an average ARI of 0.99 with 
standard deviation 0.007. The average fitted values for degrees of freedom were i>i = 56.92 and !>2 = 6.51 
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and, looking across all runs, sensible values were almost always selected (Figure 2). The average values 
for skewness were Ai = (0.021,-0.035) and A2 = (0.014,0.011), which makes sense when fitting elliptical 
components. 
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Figure 2: Histogram of the estimates for the degrees of freedom parameter from fitting the SfCLUST models 
on the 50 random runs for Simulation 1. 



The clustering performance of the SNCLUST family was also very good despite the fact that one of the 
components was simulated from a multivariate f-distribution. 

4.3 Simulation 2 

In this section, skewed data were generated from bivariate skew-t distributions using the convenient rep- 
resentation given in (2). In all, we used 50 random simulations with three components of sizes ni = 150, 
n2 = 200, and = 150, respectively, with locations = (0,0), ^2 = (6, 25), ^3 = (4,0), skewness vectors 
Ai = (2, 4), A2 = (—2, 4), A3 — (2, 4), degrees of freedom ui = 10, 1^2 = 8, 1/3 = 70, and scale matrices 

= ( ol oi)^ = ( 0^5 °f ) ' = ( V 0^ ) • 

Note that the third group has high degrees of freedom and is effectively a skew-normal cluster. All families 
were run for i = 1, . . . , 9 components; as a matter of practice, when the BIC chose a g = 9 component model, 
the algorithm was rerun for i — 1, ... ,12. 

When applied to these simulated data sets, the SiCLUST family of models gave excellent clustering 
performance, obtaining an of average ARI of 0.983 and correctly identifying the number of components 90% 
of the time (Table 2). MCLUST and iEIGEN, on the other hand, consistently overestimated the number 
of components and produced inferior ARI scores. The poor classification performance of the MCLUST and 
tEIGEN mixtures can sometimes be mitigated by merging components. For example, if we look at the 
results from a single run in Figure 3, we see that the classification performance of fEIGEN can be made 
equal to that of SiCLUST by merging components. However, merging the MCLUST components here will 
not make its classification performance equal to that of SfCLUST. The issue of merging is further considered 
in Sections 5.3 and 6.4. 
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Table 2: The number of components selected by MCLUST, tEIGEN, and SiCLUST for simulation 2, along 
with average ARI values and standard deviations. 



Groups Selected 


MCLUST 


tEIGEN 


StCLUST 


3 




18% 


90% 


4 


18% 


32% 


6% 


5 


18% 


18% 


4% 


6 


26% 


20% 




7 


30% 


12% 




8 


2% 






9 


4% 






10 


2% 






Mean ARI 


0.747 


0.749 


0.983 


Std. Dev. ARI 


0.116 


0.119 


0.040 



Results using MCLUST ieigen siclust 




Figure 3: The maximum a posteriori classification obtained from fitted MCLUST, tEIGEN, and SiCLUST 
models on one run of simulation 2. 



4.4 Simulation 3 

Both simulations in the previous sections represented well separated groups in two dimensions. To test 
these methods for a more complicated data set, a final simulation study was conducted using two overlap- 
ping groups in three dimensions. A two component multivariate skew-t data set was simulated with 100 
observations in each group. Typically, the overlapping groups created an X-shaped figure (e.g., Figure 4). 
One-hundred such random data sets were simulated and the results (Table 3) show that the SNCLUST 
and StCLUST models perform at least as well as the MCLUST and iEIGEN famihes. Notably, StCLUST 
selected a two-component model for 98% of the simulated data sets and obtained the highest average ARL 
Figure 5 plots the clustering results corresponding to simulated data in Figure 4. 

5 Applications 

5.1 Data Sets 

The efficacy of the StCLUST and SNCLUST models is demonstrated on four real data sets. The bank, crabs, 
iris, and wine data are frequently used benchmark data sets for testing the performance of various model- 
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Figure 4: A typical 3D scatterplot of the simulation as described in Section 4.4, where o represents an 
observation from simulated component 1 and * represents an observation from a simulated component 2. 



Table 3: The number of components selected by MCLUST, iEIGEN, and StCLUST for simulation 3, along 
with average ARI values and standard deviations. 



Groups Selected 


MCLUST 


iEIGEN 


SNCLUST 


StCLUST 


2 


6% 


64% 


60% 


98% 


3 


51% 


12% 


38% 


2% 


4 


35% 


16% 


2% 




5 


8% 


7% 






6 




1% 






Mean ARI 


0.482 


0.572 


0.606 


0.650 


Std. Dev. ARI 


0.106 


0.134 


0.090 


0.073 



based clustering algorithms. As in Section 4, we facilitate direct comparison with MCLUST by restricting 
the tEIGEN, SNCLUST, and StCLUST famihes to analogues of the ten MCLUST models for all analyses 
in Section 5. 

The Swiss Bank data (Flury and Riedwyl, 1988) records six different measurements on 100 genuine and 
100 counterfeit Swiss banknotes. The crabs data (Campbell and Mahon, 1974) contain the morphological 
measurements of frontal lobe size, rear width, carapace length, carapace width, and body depth. The mea- 
surements are taken in millimetres on 200 crabs, 50 from each combination of sex (male, female) and colour 
(blue, orange). The iris data (Anderson, 1935; Fisher, 1936) contain the length and width in centimetres 
of the sepal and petal of three species of irises. The data set comprises the measurements of 150 flowers, 
50 from each of the various species: setosa, versicolor, and virginica. The wine data (Forina et al., 1988) 
contain 13 chemical and physical properties of wine. The data set comprises measurements from 178 wines 
from three different cultivars; Barolo, Barbera, and Grignolino. 

5.2 Results 

The MCLUST, tEIGEN, SNCLUST, and StCLUST families — with the latter three restricted to the 
MCLUST models — were fitted to the the scaled data for i = 1, . . . , 9. Again, in the event that a g — 9 
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Figure 5: The solutions found by MCLUST, ffilGEN, SNCLUST, and StCLUST (top left, top right, bottom 
left, bottom right, respectively) when applied to the simulated data presented in Figure 4. 

component model was selected, models were rerun for i = 1, . . . , 12. The chosen models (in accordance with 
the BIG) are presented with their corresponding ARIs in Table 4; at least one of our two skewed families 
outperforms both MCLUST and tEIGEN in each case. 



Table 4: The ARI and members selected for each family when run on the bank, crabs, iris, and wine data 
sets. For each data set, the best ARI is highlighted in bold face. 









Model Selected, Number 


of Components, AR.I 






MCLUST 


tEIGEN 


SNCLUST 


StCLUST 


Bank 
Crabs 
Iris 
Wine 


EEE 4, 0.679 
EEE, 4, 0.311 
VVV, 2, 0.568 
VEI, 8, 0.481 


CCCC, 4, 0.679 
CCCC, 10, 0.508 
UUUC, 2, 0.568 
CICC, 7, 0.454 


EEE, 4, 0.606 
EEE, 4, 0.838 
VEI, 3, 0.941 
VEI, 4, 0.801 


EEE, 2, 0.980 
EEE, 5, 0.741 
VEI, 3, 0.922 
EVI, 3, 0.947 



Focusing first on the bank data set, we see comparable results between the MCLUST, iEIGEN, and 
SNCLUST families. However, a great improvement in ARI is achieved when the SiCLUST family is used. 
The superiority of StCLUST over the other three families is largely attributable to the fact that it selects 
the same number of components as true classes. The remaining families select a four-component model and 
so their performance might be improved via merging components (Section 5.3). 

The famous crabs data set is typically classified by sex and species (blue or orange), corresponding to four 
groups. Although the MCLUST family correctly selects a four-component model, the group memberships 
do not agree with this known partition and produced a low ARI (0.311). The iEIGEN family obtained a 
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higher ARI (0.508); however, it greatly overestimates the number of groups. SNCLUST, which selects a four- 
component model, performs much better than its competitors. This is not surprising because StCLUST is 
indicating skew-normal components (£>! = 90, V2 = 89.807, j>3 = 90, z>4 = 8.455, 1)5 = 90, i^e = 90) and the pa- 
rameter estimates suggest that the groups are negatively skewed; Ai — (—1.4, —1.2, —1.5, —1.6,-1.4)', A2 = 
(-1.9, -1.5, -2.0, -2.0, -1.9)', A3 = (-1.3, -1.7, -1.4, -1.5, -1.3)', A4 = (-1.3, -1.3, -1.2, -1.2, -1.2)'. 

For the iris data, the ARI obtained for both the MCLUST models and their t-analogucs arc comparable. 
Thus, little gain is achieved through the inclusion of the degrees of freedom parameter present in the tEIGEN 
and StCLUST models. The failure of the addition of the degrees of freedom parameter to improve on the 
clustering results here might be connected to its limitations as a onc-dimcnsional parameter. Had we used a 
j>dimensional degrees of freedom parameter, allowing different degrees of freedom in each dimension, some 
improvement may have been observed. This extension will be explored in future work. 

The results for the wine data set toll a similar story to that of the banknotes. As can be seen in Table 4, 
MCLUST and tEIGEN produce almost identical clustering results; however, a considerable improvement in 
ARI is apparent once skewness is introduced. As mentioned previously, the increased ARI is greatly owing 
to the fact that SfCLUST selects the value of g corresponding to the number of true classes. 

5.3 Comparing with Merged Clusters 

The recent work of Baudry et al. (2010) argues that the number of mixture components does not necessarily 
correspond to the number of true groups or clusters. For instances where an underlying group is comprised 
of a mixture of two or more Gaussian distributions, the foregoing paper proposes a method for combining 
mixture components to represent one cluster. This method first fits a mixture of Gaussian distributions with 
g components then successively merges mixture components together, resulting in a fc component solution 
where k < g. The recent version of mclust (i.e., version 4.0, Fraley et al., 2012) contains a function called 
clustCombi that implements the methodology proposed in Baudry et al. (2010). This section compares the 
clustering results produced by clustCombi with the results presented in Section 5.2. 

To compare our results with the optimal solution derived by merging components, we used two different 
merging techniques. The first involves the clustCombi function, which starts with the original g component 
solution (as chosen by the BIG) and combines two components according to an entropy criterion to obtain 
a, g — 1 component solution. This procedure is carried out until the one component solution is obtained. 
The ARI is calculated at each merging step and the solution with the largest ARI is saved as the 'best'. 
Hereafter we refer this result as the maximum clustcombi solution or MGG for short. The second method 
involves merging components by hand to create the most advantageous solution. When looking at a cross 
tabulation of the known group labels (or true clusters) against the clustering results obtained from the 
clustering algorithm, this becomes a matter of combining the columns that will maximize the agreements 
between group labels and components. An example of this merging process using cross-tabulations (or so- 
called classification tables) is demonstrated using the MCLUST solution for the bank dataset in Table 5. 
Herein, we shall refer to this second merging method as the merging by hand or MBH solution. Note that 
both the MCC and MBH procedures rely on knowing the true group membership labels; therefore, they 
could not be used for real clustering applications. 

Table 6 summarizes the ARI values for the pertinent models (indicated using gray font) after performing 
the MCC and MBH methods described above. Note that if the true number of clusters is greater than or 
equal to the number of components found by the clustering algorithm, no gain in ARI can be obtained by 
merging. For instance, when the clustering algorithms under consideration are applied to the iris data set 
(which has three true clusters) either a two or three component solution is obtained and no combination 
of component merging will result in a better classification. Because no merging is done, the results are 
unchanged from Table 4 and are given in black font in Table 6. Even when a merging technique is used, the 
SNCLUST and SiCLUST models perform as well as or better than the MCLUST and iEIGEN families in 
each case. 
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Table 5: Left: Classification results found by MCLUST cross-tabulated against the true group labels. Right: 
The merging by hand (MBH) classification results (created by merging the second with the third components 
found by MCLUST and similarly merging the first with the fourth) cross-tabulated against the true group 
labels. 



MCLUST MBH 
merge MCLUST 

True 1 2 3 4 components 2 and 3 True 1 

& components 1 and 4 

1 1 75 24 ^ 1 1 

2 15 85 2 100 



Table 6: ARI values for each data set after merging components is considered. The ARI values calculated 
after merging components (cf. Table 4) are given in grey font. Where merging cannot improve classification 

performance, the (unchanged) ARI values arc given in black font. For each ARI value, the associated number 
of components is given in brackets and for each data set, the best ARI value is typed in bold face. 











ARI 








MCC 






MBH 






MCLUST 


MCLUST 


tEIGEN 


SNCLUST 


StCLUST 


Bank 
Crabs 
Iris 
Wine 


0.311 (4) 
0.568 (2) 
0.876 (5) 


0.980 (2) 
0.679 (4) 
0.568 (2) 
0.901 (3) 


0.980 (2) 
0.783 (4) 
0.568 (2) 
0.867 (3) 


0.838 (4) 
0.941 (3) 
0.947 (3) 


0.980 (2) 
0.753 (4) 
0.922 (3) 

0.947 (3) 



6 Mo del- Based Classification 

6.1 The Model 

In the event that a subset of the data under consideration has known group membership labels, model-based 
classification can be used. This is a semi-supervised version of model-based clustering that uses both the 
labeled and unlabelled data to produce parameter estimates. If we order the data such that the first m 
observations (m < n) have known labels, then the Gaussian model-based classification likelihood is 

m g n h 

I xi,...,x„,zi,...,z™) = Jl J|[7ri(?!)(xj | /Lt^, Sj)]^'^' ^7rj0(xfe | 

j=l i=l fe=TO+l 1 = 1 

for h> g; often, as in our analyses (Sections 6.2, 6.3, and 6.4), it is assumed that h = g. 

We apply the SNCLUST and StCLUST families for model-based classification, drawing comparison with 
the MCLUST and fEIGEN families. In addition to the real data that we used to illustrate our clustering 
approaches, we analyze real food authenticity data on Italian olive oils. To facilitate direct comparison with 
MCLUST, we restrict the tEIGEN, SNCLUST, and StCLUST famihcs to analogues of the ten models used 
in MCLUST (cf. Table 1) for the analyses in Sections 6.2 and 6.3. However, we consider all 14 models (i.e., 
all 14 constraints for the decomposition of the component scale matrices) for the SNCLUST and StCLUST 
families in Section 6.4. 

6.2 Olive Oil Data 

The olive oil data (Forina and Tiscornia, 1982; Forina et al., 1983) contain the percentage of eight fatty acids 
found in 572 Italian olive oils. The data set are available within the pgmni package (McNicholas et al., 2011) 
for R. Each oil sample comes from one of three distinct regions, which can be further partitioned into nine 
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areas. McNicholas (2010) used these data to illustrate model-based classification, by taking random subsets 
of 50% of the observations to have known component membership labels; we will do the same to illustrate 
our models. 

We first consider the problem of classifying the olive oils into their regions. The four families were run 
using 15 different random subsets with 50% of the labels known and g = 3. As advocated by Andrews et al. 
(2011), we employ a uniform initialization whereby the imknown observations have initial predicted labels 
Zij = 1/g, for i = 1, . . . ,g and j = 1, . . . , n. A similar approach is repeated for the g = 9 component models, 
this time classifying the data by geographical area. The average ART values for the fitted models (Table 7) 
confirm the slightly superior performance of the SNCLUST family. However, we note that all four families 
give good classification performance when classifying by region and by area. 

We notice that the StCLUST models obtain a lower ARI than the tEIGEN models when classifying the 
olive oil into geographical area. One possible explanation is that uniform initialization is not suitable for the 
SiCLUST family. Because these procedures are susceptible to converge to local maxima, and considering 
the relatively complicated likelihoods associated with the skew-t family, perhaps a more guileful approach 
such as a deterministic annealing EM (cf. Ueda and Nakano, 1998) is necessary. Further investigation of this 
idea will be a subject of future work. 



Table 7: The average ARI values, with standard deviations, for the fitted MCLUST, ffilGEN, SNCLUST, 
and StCLUST models for the olive data by region {g = 3) and area {g = 9). For each data set, the best ARI 
is highlighted in bold face. 







MCLUST 


iEIGEN 


SNCLUST 


StCLUST 


Olive (Region) 
Olive (Area) 


0.9962 (0.0036) 
0.9091 (0.0333) 


0.9965(0.0032) 
0.9185 (0.0224) 


0.9983 (0.0028) 
0.9249 (0.0242) 


0.9946 (0.0021) 
0.8994 (0.0307) 



6.3 Other Data 

The same four data sets considered in Section 5 are used for model-based classification; again, 50% of the 
labels are taken to be known. Each data set was run with 25 different random subsets of known labels with 
the number of groups specified to the correct value. Although the MCLUST and tEIGEN families benefit 
substantially under a classification framework, SNCLUST obtained the highest ARI on all four data sets 
(Table 8). Comparing StCLUST with the its symmetric alternative tEIGEN, we see an improvement in ARI 
for all but the crabs dataset. As mentioned in Section 6.2, this inconsistency could be a result of the uniform 
initialization. 



Table 8: Average ARI values, with standard deviations, for the fitted MCLUST, fEIGEN, SNCLUST, and 
SiCLUST models. For each data set, the best ARI is highlighted in bold face. 





MCLUST 


tEIGEN 


SNCLUST 


StCLUST 


Bank 


0.9760 (0.0200) 


0.9760 (0.0200) 


0.9911 (0.0102) 


0.9792 (0.0204) 


Crabs 


0.8798 (0.0474) 


0.8818 (0.0451) 


0.9255 (0.0389) 


0.8575 (0.0632) 


Iris 


0.8910 (0.0311) 


0.8910 (0.0311) 


0.9447 (0.0305) 


0.9199 (0.0574) 


Wine 


0.9004 (0.0396) 


0.8271 (0.0505) 


0.9444 (0.0504) 


0.8379 (0.0770) 


6.4 


The EVV, EVE, 


VEE and WE models 







Heretofore, we restricted the SNCLUST and StCLUST families to correspond to the ten MCLUST models. 
This restriction was appropriate to facilitate direct comparison with MCLUST. In this section, we will 
consider the full SNCLUST and SiCLUST families; i.e., the SNCLUST and StCLUST famiUes with all 14 
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models (Table 1). We apply the full SNCLUST and SiCLUST families for model- based classification of the 
data sets considered in Sections 6.2 and 6.3. We compare the performance of the full families to the versions 
restricted to correspond to MCLUST. Note that wc implement the four 'extra' models for the SNCLUST 
and StCLUST families using the majorization-minimization (MM) algorithm (cf. Hunter and Lange, 2004) 
developed by Browne and McNicholas (2013). 

Table 9 presents ART scores associated with the reduced (i.e., MCLUST-analogous) as well as the full 
SNCLUST and StCLUST families. The last two columns summarize the proportion of runs where one of 
the four 'extra' models was chosen (these correspond to the EVV, EVE, VEE, and WE models in Table 1). 
Our results demonstrate that one of these models is often chosen, resulting in a higher average ART score 
for all but one of the StCLUST runs as well as half of the SNCLUST runs. To be more specific, EVE was 
selected between 59% and 100% of the time for all but the olive oil classification by region, where the EVE 
model was never selected. Wc therefore advocate the inclusion of all fourteen models in practice rather than 
the subset of ten models used by MCLUST. 



Table 9: Avera 


,ge ARI values, 


with standard deviations, for the fitted SNCLUST, 


and StCLUST 


using the 


ten model (orig 


;inal) and 14 model (full) families. For each data set, the best ARI is 


highlighted in 


bold face. 




Original 10 model family 


Full 14 model family 


% of non-MCLUST 




SNCLUST 


StCLUST 


SNCLUST 


StCLUST 


SNCLUST 


StCLUST 


Bank 


0.9911 (0.010) 


0.9792 (0.020) 


0.9908 (0.010) 


0.9896 (0.010) 


100% 


100% 


Crabs 


0.9255 (0.039) 


0.858 (0.063) 


0.9109 (0.028) 


0.9170 (0.033) 


93% 


92% 


Iris 


0.9447 (0.031) 


0.9199 (0.057) 


0.9622 (0.024) 


0.9658 (0.020) 


98% 


100% 


Wine 


0.9444 (0.050) 


0.8379 (0.077) 


0.9374 (0.046) 


0.9474 (0.030) 


64% 


85% 


Olive(Region) 


0.9981 (0.004) 


0.9965(0.003) 


0.9994 (0.002) 


0.9983 (0.002) 


100% 


67% 


Olive(Area) 


0.9091 (0.033) 


0.9185 (0.022) 


0.9655 (0.012) 


0.9562 (0.011) 


100% 


100% 



7 Concluding Remarks 

This paper builds on the growing trend towards non-Gaussian model-based clustering by developing two 
families of models that account for skcwness: skew-normal and skcw-t analogues of the GPCM family of 
models. Parameter estimation was outlined and our novel families were applied to simulated and real 
data. Both model-based clustering and classification were illustrated using several real data sets, and the 
performance of StCLUST and SNCLUST was generally superior to their symmetric analogues. Interestingly, 
this superior performance was often retained even after merging components was considered. 

Future work will focus on the initialization of these families in addition to the search for more effective 
model selection techniques. As briefly mentioned in Section 5.2, we could investigate the efficacy of extending 
the degrees of freedom to be p-dimensional. From the classification viewpoint, we only considered the semi- 
supervised model-based classification scenario, where one mixture component corresponds to a class. The 
straightforward extension to model-based discriminant analysis will be investigated and incorporated into R 
packages that are being developed for the StCLUST and SNCLUST families. Finally, the use of our skewed 
families in the analysis of data of mixed type (cf. Browne and McNicholas, 2012) will also be investigated. 
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